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Abstract 

Building on the success of recent discriminative mid¬ 
level elements, we propose a surprisingly simple approach 
for object detection which performs comparable to the cur¬ 
rent state-of-the-art approaches on PASCAL VOC comp-3 
detection challenge (no external data). Through extensive 
experiments and ablation analysis, we show how our ap¬ 
proach effectively improves upon the HOG-based pipelines 
by adding an intermediate mid-level representation for the 
task of object detection. This representation is easily inter¬ 
pretable and allows us to visualize what our object detector 
''sees We also discuss the insights our approach shares 
with CNN-based methods, such as sharing representation 
between categories helps. 

1. Introduction 

How do we represent and recognize objects such as the 
dog or the car shown in Figure 1? Until recent years, the 
most popular way to represent objects was using low-level 
features such as HOG [11] or SIFT [37]. These low-level 
features were then used to train the classifiers such as SVMs 
or random forests. Recently, several approaches have pro¬ 
posed discriminative mid-level visual elements as an inter¬ 
mediate image representation between the low-level fea¬ 
tures and the high-level semantic classes. While these ap¬ 
proaches have shown strong results for a variety of tasks 
such as indoor scene classification [14], 3D scene under¬ 
standing [24], video understanding [28] and even visual pre¬ 
diction [54], relatively little effort has been devoted toward 
adapting them for object detection (with the notable excep¬ 
tion of [18], which while providing a first step towards ob¬ 
ject detection on PASCAL [20], leaves room for improve¬ 
ment quantitatively). 

In this paper, we build upon a recently-proposed mid¬ 
level representation framework [14] and adapt it for the task 
of object detection. Even though our mid-level representa¬ 
tion uses a HOG-based pipeline, it still performs compa¬ 
rably to convolutional neural networks (CNNs) [1, 25] on 
the comp3 detection challenge (no external data allowed). 
However, when compared to other HOG-based approaches, 
it does provide a substantial boost. We believe this boost 



Figure 1. Left: Input image and a visualization of what our object 
detector sees. Right: The average images of the mid-level ele¬ 
ments which are most useful for detecting objects in input images. 


is significant since it points out the importance of having a 
mid-level representation in a recognition pipeline, and may 
guide research in designing mid-level features and their ap¬ 
plication in object detection. 

Why mid-level representation? Over the years there has 
been a lot of research in low-level and high-level visual rep¬ 
resentation. Low-level representations are susceptible to 
small variations in style and pose. On the other hand, di¬ 
rectly learning high-level representations require millions of 
labeled images of objects in all possible configurations, and 
it is difficult to encode large intra-class variation. Therefore, 
what we need is a mid-level representation in an object- 
detection pipeline: a representation that is more adapt¬ 
able to the appearance distributions in the real world than 
the low-level features, but does not require the semantic 
grounding of the high-level entities. 

There have been efforts to include mid-level represen¬ 
tations such as poselets [6] and object-parts [18] but none 
of these approaches have given any significant boost to la¬ 
tent SVM-based approaches [21]. On the other hand, CNN- 
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Figure 2. Feature Representation: Given an image (top left), the region proposals are first extracted (top center). The mid-level elements 
are trained offline (bottom left), and then each region proposal is represented by convolving mid-level elements over a HOG feature pyramid 
extracted from the region (bottom center). The responses are max-pooled across different scales in a spatial pyramid pattern to construct 
the final feature, which is then fed into a linear SVM classifier. Refer to Section 3.2 for details. 


based approaches for object detection [25] have outper¬ 
formed classic object detection approaches [21]. We be¬ 
lieve one of the reasons for better performance of CNN- 
based approaches is the existence of the discriminatively- 
trained mid-level representation, which in this case consists 
of multiple layers of convolution. But these CNNs still re¬ 
quire millions of images to train the networks and therefore, 
in case of low data availability (comp3 challenge in PAS¬ 
CAL [20]), they are still comparable to existing approaches. 
In this paper, we want to explore the alternative mid-level 
representation proposed in [14]. We explore how including 
this mid-level representation can increase the performance 
of a classic HOG-based pipeline. 

Contributions: Our paper is one of the first papers to 
demonstrate how discriminative mid-level elements [15, 45] 
can be used effectively for the task of object detection. The 
goal of this paper is to analyze how mid-level representa¬ 
tions can boost the performance of a HOG-based pipeline. 
Specifically, we have shown that “simple” HOG features 
have more power if a “shallow” mid-level visual element 
representation used in the HOG pipeline. Using our ap¬ 
proach, we achieve performance comparable to the state-of- 
the-art on PASCAL [20] comp3 object detection challenge. 
But more importantly, we hope this paper will be able to 
rekindle the discussion on mid-level representations and in¬ 
spire more researchers to look at the mid-level elements as 
an important component in an object detection pipeline. 

2. Related Work 

Over the past decade, object detection has been one of 
the most extensively studied problems in computer vision. 
One of the early advancements in statistical object detec¬ 
tion came back in 2005 when Dalai and Triggs [11] in¬ 
troduced histograms-of-gradient (HOG) descriptor to rep¬ 


resent object templates and coupled it with SVM. Conse¬ 
quently, much subsequent work focused on exploiting the 
HOG+SVM strategy, in conjunction with exhaustive slid¬ 
ing window search. The most successful have been de¬ 
formable parts-based models (DPM) [21]. DPM extended 
these HOG-based templates by adding part templates and 
allowing deformation between them. The emergence of 
DPM, and improvements in algorithms to train it, have led 
to a brisk increase in performance on the PASCAL VOC 
object detection challenge. Later, numerous works focused 
on improving the parts themselves, from using strongly- 
supervised parts [6, 5, 19, 57] to using weak 3D supervi¬ 
sion [44, 47, 43, 40]. 

An alternate direction for improvement in performance 
was to incorporate bottom-up segmentation priors for train¬ 
ing DPMs [8, 23]. One such approach, SegDPM [23], aug¬ 
mented HOG features with simple segmentation-based fea¬ 
tures and respectably outperformed other DPM-style ap¬ 
proaches. However, these approaches have di fundamental 
limitation - given the complexity of exhaustive search, they 
can only utilize simple features. 

As a consequence, a major shift in detection paradigm 
was to bypass the need to exhaustive search completely by 
generating category-independent candidates for object loca¬ 
tion and scale [3, 17, 50, 7, 4, 9, 58, 30]. Commonly-used 
methods propose around 1,000 regions using fast segmenta¬ 
tion algorithms, which aim to discard many windows which 
are unlikely to contain objects [56, 3, 50]. These object 
proposal methods have resulted in the use of more sophisti¬ 
cated features [56, 23, 10, 51] and learning algorithms [53]. 
For example, [10, 51] use improved Fisher Vectors over 
SIFT [37] and color descriptors; [52] uses color descrip¬ 
tors, feature encodings and spatial poolings; and [53] use 
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Figure 3. Most informative elements by category (positive: two elements represent the category; negative: one element representing what 
it is not). Each row (in both columns) depicts the top three training-set detections for three of the most informative elements for one 
category. We measure an element’s informativeness by the weight of the respective dimension in the category-level SVM’s w vector. Left 
two sets depict the two most positive-weighted elements; the right shows the most negative-weighted elements. The positive-weighted 
elements were all mined from the positive category (demonstrating the utility of discriminative training), and the negative ones often depict 
patterns easily confused with the category, or from objects that commonly appear in that category’s context. 


multiple kernel learning on top of a variety of appearance 
features with spatial pooling. 

Concurrently, researchers have studied another impor¬ 
tant class of features that are derived from CNNs [33], es¬ 
pecially the formulation proposed by [31]. Recently, CNNs 
have consistently shown state-of-the-art performance on 
image classification, motivating a number of researchers to 
apply CNNs to the task of object detection. One strategy has 
been to train similar networks directly for object detection; 
for example, [49] poses object localization as a regression 
problem, while [25, 1] trains CNN to directly classify re¬ 
gion proposals. The methods using CNN-based features in 
the region proposal paradigm are currently the state-of-the- 
art (e.g., RCNN [25]) on PASCAL VOC detection challenge 
by a comfortable margin. 

Mid-level visual elements: Mid-level visual elements, or 
mid-level discriminative patches, are similar to parts, but 
are generally not constrained to a particular location in 
an object template [15, 45]. While the locations of these 
discriminative patches within the dataset are generally not 
known beforehand, they can still be identified by measur¬ 
ing (1) how representative they are of a particular category, 
and (2) how informative they are with respect to identify¬ 


ing whatever categories they represent. Numerous works 
have shown strong performance on a wide variety of tasks, 
including scene classification [45, 14, 29, 35, 48, 55], vi¬ 
sual data mining [15], video understanding [28], video- 
based prediction [54], 3D geometry [24], and even unsuper¬ 
vised object discovery [45] . Particularly relevant is the work 
of [18] applying mid-level elements to object detection; 
though the results were promising, they were well below 
the canonical HOG-based approaches [21]. The paradigm 
of using mid-level elements is similar to object bank [34], 
with the key difference being that visual elements often cap¬ 
ture visual concepts of a smaller granularity, which makes 
them more shareable across categories, and more robust to 
large changes in object appearance. 

In this paper, we propose a representation using HOG- 
based mid-level elements in the region proposal paradigm 
and achieve results comparable to the state-of-the-art on 
PASCAL VOC comp3 challenge (no external data). 

3. Object Detection Pipeline 

Our object detection pipeline is similar to the recent 
work of Girshick et al. [25]. While their approach is built 
around CNN features, ours uses HOG-based mid-level vi¬ 
sual elements. Our detection pipeline has three basic mod- 
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Figure 4. Pooling Scheme: (1) 5-region pooling (1x1, and 2x2); (2) 7-region pooling (1 x 1, 1 x 3, and 3x1). 


ules: (1) Region proposals: class-independent object hy¬ 
potheses obtained via exhaustively searching over multiple 
segmentations of a given image; (2) Mid-level visual el¬ 
ements: a set of detectors, each of which corresponds to 
some discriminative part of a category, and whose responses 
within a given region proposal are aggregated into a repre¬ 
sentative feature for that proposal; (3) Class-specific clas¬ 
sifiers: a class-specific classifier is used to classify whether 
a region proposal belongs to a particular class or not. Post 
Processing is then applied on the output of these classifiers 
to avoid overlapping detections and to improve localization. 

3.1. Region proposals 

Much recent work in computer vision has been devoted 
to proposing, within a given image, a set of regions that 
might correspond to objects. The idea is that these re¬ 
gions should provide high recall while minimizing the num¬ 
ber of regions that need to be considered. This reduces 
the computational complexity during detection stage, and 
biases the algorithm toward ‘object-like’ regions. Ob- 
jectness [3], category-independent object proposals [17], 
randomized prim [38], selective search [50], constrained 
parametric min-cuts [7], multi-scale combinatorial group¬ 
ing [4], binarized normed gradients [9], edge boxes [58], 
and geodesic proposals [30] all provide different trade-offs 
of speed, recall, and the total number of object proposals re¬ 
turned. Our approach is agnostic to the kind of region pro¬ 
posals used. We extract about 2,000 region proposals per 
image using selective search [50]. This allows us to make a 
fair comparison with SS-SPM [50] and with R-CNN [25]. 

3.2. Mid-level visual element representation 

Given a region, the next major challenge is to build a 
representation of its contents that can easily be classified 
as one of the object categories, or as background. Many 
hand-tuned low-level representations exist (e.g. HOG [11] 
and SIFT [37]), but these have limited invariance to the sort 
of deformations seen in objects. More complex represen¬ 
tations like bag of visual words [46] and, more recently, 
improved Fisher vectors [41, 10], improve the invariance 
to deformation by ignoring the spatial position of each vi¬ 
sual word within the region. However, the basic elements 
of these representations (e.g., SIFT [37]) have limited spa¬ 
tial extent and therefore capture relatively simple concepts. 
Furthermore, these features are generally not tuned to be 
discriminative with respect to the object categories of inter¬ 
est. On the other hand, DPMs [21] have large parts which 


are trained discriminatively, but are less flexible in other re¬ 
spects; for instance, it is more difficult to share parts across 
different views of a given object category. 

Representations based on mid-level discriminative 
patches [45, 14, 15, 18, 28, 29, 35, 24, 48, 55, 54] have 
recently shown strong performance for many vision tasks, 
especially scene classification. The idea is to find patches 
which diXt frequent, i.e., they will occur many times in the 
category of interest; discriminative, i.e., easily recogniz¬ 
able; and informative, in that they occur in only one of 
the categories. Detectors for these patches are commonly 
implemented using medium-sized HOG templates, and are 
therefore similar to the “parts” of DPM. However, the train¬ 
ing generally uses weaker supervision (e.g., image-level la¬ 
bels), and no spatial layout is assumed. 

Mining Mid-level Elements: For discovering mid-level el¬ 
ements, we use the formulation of [14], which uses a dis¬ 
criminative extension of mean shift. They formalize the 
idea of “frequent yet informative” by attempting to find re¬ 
gions of patch feature space that satisfy two properties: (1) 
it is populated by a reasonable number of patches; and (2) 
the ratio between the positive and negative patches is maxi¬ 
mized in the region. Essentially this corresponds to finding 
the local-maxima of an estimate of the density ratio between 
positives and negatives. 

We use this approach to mine a set of N mid-level ele¬ 
ments for each category, where N G {100, 200, 300, 500}. 
These elements are mined using the ground-truth training 
set boxes (dilated by 25% of its size) which act as positives, 
and images not containing the object as negatives. To fur¬ 
ther improve the localization and reduce confusion arising 
out of sharing between similar categories, we also mine 50 
elements per category such that they have an overlap (loU) 
greater than 0.8 with the ground-truth boxes (see Table 2). 
Feature Representation: We now use these mid-level el¬ 
ements to generate representation for region proposals. To 
construct the feature vector on each region proposal, a HOG 
pyramid for the region is extracted, and then a sliding win¬ 
dow operation is done within the pyramid using these mid¬ 
level elements (regardless of category). We then max-pool 
the responses of each element across different scales using 
a 2-level spatial pyramid (1x1 and 2x2 grids) [32, 26] as 
shown in Figure 2. These 5 pooling regions, N elements per 
category and c categories make a (A^ x 5 x c) dimensional 
feature vector. We also experimented with another pooling 
scheme, where we pool in 7 regions (1x1, 1x3, and 3x1 
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Figure 5. Examples of detections in the PASCAL VOC 2007 test set (in each case, we show only the top detection in the image). Images 
outlined in yellow denote false positives. 


grids) (as shown in Figure 4). 

Implementation Details: For a speedy feature extraction, 
we construct a single feature pyramid for an entire image 
and then extract responses for each region proposal from 
this whole-image pyramid (c.f. [26]). For the patch-level 
features, we use [14] where each 64 x 64 pixel window is 
represented bya6x6x31 HOG and 6x6x2 image 
(down-sampled ab channels of the corresponding Lab im¬ 
age); thus resulting in 6 x 6 x 33 feature. For the HOG 
feature pyramid, we use 4 scales per octave during training 
(for efficiency) and 8 scales per octave during testing (for 
accuracy) (c.f. [21]). We up-sample images by a factor of 2 
when evaluating proposals smaller than 80 x 80 pixels. 

3.3. Object detection using mid-level elements 

Given a feature representation for a region proposal, we 
use class-specific classifiers [25, 50] to predict whether a 
proposal belongs to a particular category or not. We post¬ 
process the output of these classifiers to remove overlapping 
detections via non-max suppression (NMS) [21, 25] and im¬ 
prove localization via bounding box regression [25]. 
Class-specific classifiers: We train a simple 1-vs-all linear 
SVM [25, 50] for each category. During training, we use 
all ground truth bounding boxes (and their flipped versions) 
as positives for their respective classifiers, and any window 
with loU < 0.2 with a ground-truth box for a given cat¬ 
egory as negative for that category (all other windows are 


discarded). We found that only one iteration of hard nega¬ 
tive mining was sufficient for convergence. 

NMS: NMS [21, 25] works by iteratively selecting the 
highest-scoring proposal from the pool of candidates from 
an image, and then removing all candidates with loU greater 
than a given threshold (0.3 in our case) with the selected 
proposal [25]. 

Bounding Box Regression: Bounding box regression 
(BBReg) [25] model is a class-specific regressor which aims 
to improve localization. It learns a transformation func¬ 
tion F which maps a proposal’s features to the associated 
ground-truth bounding box. F is assumed to be a linear 
function of the proposal’s features, where the output space 
is a 4-vector that defines (1) x- and (2) y-translation on the 
bounding box’s upper-left comer (scaled by the input box’s 
width and height respectively), as well as (3) x- and (4) y- 
scaling factor for the width and height of the bounding box, 
in log space. Our implementation follows [25], except that 
we replace CNN features with our mid-level features. 

4. Experiments 

We now discuss our experimental results on the standard 
PASCAL VOC-2007 and VOC-2010 [20] dataset for object 
detection. We also perform an extensive ablative analysis to 
understand how various design choices impact the perfor¬ 
mance. 
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Figure 6. Reconstructing using mid-level representations. First, we compute average images for our elements. Then given the detections, 
we can visualize why the detections occurred. We display these element-level averages positioned over the element detections which 
contributed the most to the detection score (as measured by the detection score times the feature weight in the category SVM), weighted 
according to the contribution. This highlights which parts contributed the most in detecting a particular object. 


4.1. Performance on VOC-2007 

First, we compare our approach with several baselines on 
the VOC-2007 comp3 challenge (no extra data) [20]. Com¬ 
pared to DPM [21] (33.7% mAP), which also uses HOG, 
our algorithm achieves an mAP of 41.9%, a boost of ap¬ 
proximately 8% (absolute). This is a significant improve¬ 
ment, and clearly demonstrates the utility of mid-level layer 
for object detection. Interestingly, our algorithm’s perfor¬ 
mance is comparable to the state-of-the-art, even though we 
do not use any segmentation (as used by segDPM [23]) or 
context [10]. We did, however, use bounding-box regres¬ 
sion from the R-CNN [25] framework, which we found pro¬ 
vides a 3% boost in mAP. We also found that the 7-region 
pooling works slightly better than the 5-region pooling (see 
section 3.2), especially when lesser number of elements are 
used (e.g., when using top-100 elements, 5-region pooling 
gives 33% mAP while 7-region pooling gives 33.7% mAP). 

Qualitative Analysis: Mid-level elements provide a num¬ 
ber of convenient ways to understand the behavior of our al¬ 


gorithm. First, we aim to see which mid-level elements are 
useful for the task of detection. In Figure 3 we show the ele¬ 
ments that received the highest (or lowest) weights in the fi¬ 
nal class-specific SVM. We first show two elements with the 
highest positive weight, and one with the largest negative- 
weight. Note, for example, that the most discriminative as¬ 
pect of bicycles (as chosen by our mid-level representation) 
are wheels, yet the SVM has a strong negative weight for 
bus wheels; this is likely to prevent bus wheels from being 
confused with the bicycle wheels. Furthermore, dining ta¬ 
bles receive a strong negative weight for people, probably 
because a person bounding box containing too much of a 
table is likely to result in a poor localization. 

We also show some representative detections in Figure 5. 
A predominant failure mode of our algorithm seems to be 
localization error, specifically where multiple instances of 
the same category (e.g. two birds, multiple people, or bot¬ 
tle) occur together. We attribute this to the relatively ag¬ 
gressive pooling scheme in our feature vector. One way to 
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VOC 2007 test 

aero bike 

bird boat 

bottle 

bus car cat 

chair 

cow 

table 

dog 

horse 

mbike 

person 

plant 

sheep 

sofa train 

tv 

mAP 

DPM-V5 [21] 

33.2 60.3 

10.2 16.1 

27.3 

54.3 58.2 23.0 

20.0 

24.1 

26.7 

12.7 

58.1 

48.2 

43.2 

12.0 

21.1 

36.1 46.0 

43.5 

33.7 

SS SPM [50] 

43.5 46.5 

10.4 12.0 

9.3 

49.4 53.7 39.4 

12.5 

36.9 

42.2 

26.4 

47.0 

52.4 

23.5 

12.1 

29.9 

36.3 42.2 

48.8 

33.7 

RM^C [16] 

37.7 61.4 

12.7 17.6 

29.9 

55.1 56.3 29.5 

24.6 

28.2 

30.7 

21.2 

59.5 

51.5 

40.3 

14.3 

23.9 

41.6 49.2 

46.0 

36.6 

[10] (w/o context) 

52.6 52.6 

19.2 25.4 

18.7 

47.3 56.9 42.1 

16.6 

41.4 

41.9 

27.7 

47.9 

51.5 

29.9 

20.0 

41.1 

36.4 48.6 

53.2 

38.5 

Regionlets [56] 

54.2 52.0 

20.3 24.0 

20.1 

55.5 68.7 42.6 

19.2 

44.2 

49.1 

26.6 

57.0 

54.5 

43.4 

16.4 

36.6 

37.7 59.4 

52.3 

41.7 

RCNN-Scratch [1] 

1 49.9 60.6 

24.7 23.7 

20.3 

52.5 64.8 32.9 

20.4 

43.5 

34.2 

29.9 

49.0 

60.4 

47.5 

28.0 

42.3 

28.6 51.2 

50.0 

40.7 

5-Region Pooling 

50.7 58.3 

16.6 26.2 

24.2 

56.4 57.2 44.9 

18.8 

39.9 

43.5 

27.3 

44.5 

49.4 

26.8 

19.4 

35.3 

41.4 47.8 

47.4 

38.8 

5-Region + BBReg 52.0 60.9 

17.1 26.4 

25.7 

59.3 60.9 44.9 

20.6 

42.7 

46.6 

30.4 

57.1 

49.7 

32.5 

19.9 

38.0 

42.3 53.0 

50.3 

41.5 

7-Region Pooling 

49.2 58.3 

16.4 25.6 

22.5 

55.2 57.6 47.0 

19.3 

39.9 

44.8 

28.2 

44.5 

50.6 

31.1 

21.1 

35.6 

35.8 47.0 

48.8 

38.9 

7-Region + BBReg 51.7 61.5 

17.9 27.0 

24.0 

57.5 60.2 47.9 

21.1 

42.2 

48.9 

29.8 

58.3 

51.9 

34.3 

22.2 

36.8 

40.2 54.3 

50.9 

41.9 


Table 1 

. Results on VOC-2007: We use top-500 + 50 elements for our approach (last 4 rows). 



VOC 2007 test 

aero bike 

bird boat 

bottle 

bus car cat 

chair 

cow 

table 

dog 

horse 

mbike 

person 

plant 

sheep 

sofa train 

tv 

mAP 

top-100 

43.8 52.8 

11.8 18.9 

21.8 

52.0 54.5 38.7 

14.9 

32.4 

39.0 

23.1 

34.8 

39.4 

24.7 

16.2 

28.6 

31.8 39.0 

42.5 

33.0 

top-100 -|- 50 

45.8 54.3 

13.1 21.7 

22.1 

53.2 55.9 39.6 

15.6 

33.3 

41.7 

23.0 

37.9 

42.4 

26.7 

16.5 

34.1 

33.9 42.0 

46.4 

35.0 

top-200 

47.2 55.3 

12.5 23.4 

21.8 

55.4 56.1 39.2 

16.3 

37.4 

44.0 

25.0 

40.0 

42.3 

24.6 

17.3 

28.8 

35.0 44.8 

45.4 

35.6 

top-200 -|- 50 

48.1 56.1 

13.2 23.1 

22.4 

54.8 57.1 40.3 

17.4 

39.9 

42.2 

24.6 

41.3 

45.9 

25.8 

17.4 

31.0 

36.6 46.4 

45.8 

36.5 

top-300 

48.5 56.7 

13.2 22.8 

24.0 

54.9 56.6 41.2 

18.6 

36.3 

42.5 

27.3 

39.0 

44.6 

25.8 

19.0 

32.1 

38.9 45.1 

45.8 

36.6 

top-300 -|- 50 

49.7 56.5 

14.1 24.3 

24.1 

56.1 56.7 42.4 

18.6 

39.5 

43.2 

28.7 

42.1 

48.9 

26.7 

19.8 

33.4 

39.7 47.6 

46.8 

37.9 

top-500 

49.6 57.2 

16.1 25.4 

23.9 

55.6 56.6 42.9 

18.7 

37.9 

45.7 

27.9 

42.7 

49.5 

27.0 

18.6 

35.8 

37.1 47.5 

47.3 

38.2 

top-500 -|- 50 

50.7 58.3 

16.6 26.2 

24.2 

56.4 57.2 44.9 

18.8 

39.9 

43.5 

27.3 

44.5 

49.4 

26.8 

19.4 

35.3 

41.4 47.8 

47.4 

38.8 

top-200 [45] 

38.2 52.0 

5.8 15.9 

17.5 

46.1 53.2 36.3 

12.5 

30.3 

35.3 

19.2 

32.4 

40.9 

22.6 

13.7 

19.4 

26.7 36.7 

35.9 

29.5 


Table 2. Ablation Analysis: We use 5-region pooling (1 x 1, and 2 x 2) to analyze the detection performance with the number of mid-level 
elements. We also analyze the influence of adding 50 elements corresponding to loU > 0.8 with ground-truth boxes (Section 3.2). 


combat this kind of error would be to include more spatial 
information in the feature vector; however, we leave this 
investigation for future work. 

Finally, we highlight the information captured by our 
representation for a few detected objects in Figure 6. For 
each element, we first average the top-10 detections from 
the training set to get a representative image. Then for each 
detected object, we get the 20 high-scoring mid-level ele¬ 
ments, and transfer their representative images to the loca¬ 
tions where these elements were detected. Then we take the 
weighted-mean of these transfers to get the final visualiza¬ 
tion (Figure 6). Note, for example, that how representative 
wheel elements are for vehicles, and face elements for cats 
and dogs (which are in sync with the observation by [39]). 

4.2. Ablative Analysis and Detection Diagnosis 

We now perform ablative analysis to understand how dif¬ 
ferent components infiuence the performance of our system. 
First we investigate the effects of increasing the number 
of mid-level elements. For this experiment, we use the 5- 
region pooling scheme (Section 3.2). As it can be seen from 
the Table 2, the performance of our system consistently in¬ 
creases with the number of mid-level elements. 

We also compared the performance of our approach 
when we use the mid-level elements generated by [45]. Our 
results indicate that the elements obtained using discrimina¬ 
tive mode-seeking [14] are better suited for object detection. 

Finally, we use the diagnostic framework from [27] to 


better understand the failure modes of our system^. The key 
take-away is that in case of person, the localization error is 
quite significant; this is likely due to our detections encom¬ 
passing multiple instances of the object (see Figure 5). 

4.3. Performance on PASCAL VOC-2010 

We now compare the performance of our approach on 
VOC-2010 comp3 challenge (no extra data) [20] with sev¬ 
eral standard baselines, including the state-of-the-art (see 
Table 3). In this experiment, we used top-500 elements per 
category and performed 5-region pooling for feature repre¬ 
sentation. Our approach achieves 37.1% mAP, and outper¬ 
forms the standard HOG-based DPM [21] (without context) 
by more than 5% (absolute)^. We also compared our ap¬ 
proach to Boosted Collection of Parts (BCP) [18] and with 
Poselets [6], which are also based on similar ideas of using 
mid-level elements. Compared to [18], our approach has a 
significant boost of 12% (absolute). The mAP for 18 cate¬ 
gories obtained using Poselets [6] (chair and table were not 
available) is 29.6%, whereas our mAP is 38.9% for those 
categories. Note that our approach does not use any contex¬ 
tual re-scoring as done in SegDPM [23], but still achieves 
comparable results. Our approach is also comparable to Re- 
gionlets [56] which uses a combination of HOG, LBP and 
covariance features. 


^The full diagnostic report is available on authors’ website. 

^DPM [21] (with BB-Reg and without context) achieves 30.8% mAP 
as reported in [18], and DPM-v5 [21] (with BB-Reg and context) achieves 
33.4% as reported on the authors’ website 
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VOC 2010 test 

aero 

bike 

bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

table 

dog 

horse 

mbike 

person 

plant 

sheep 

sofa 

train 

tv 

mAP 

DPM-v5 (w/o context) [21] 

45.6 

49.0 

11.0 

11.6 

27.2 

50.5 

43.1 

23.6 

17.2 

23.2 

10.7 

20.5 

42.5 

44.5 

41.3 

8.7 

29 

18.7 

40.0 

34.5 

29.6 

DPM-V5 [21] 

49.2 

53.8 

13.1 

15.3 

35.5 

53.4 

49.7 

27.0 

17.2 

28.8 

14.7 

17.8 

46.4 

51.2 

47.7 

10.8 

34.2 

20.7 

43.8 

38.3 

33.4 

SS SPM [50] 

56.2 

42.4 

15.3 

12.6 

21.8 

49.3 

36.8 

46.1 

12.9 

32.1 

30.0 

36.5 

43.5 

52.9 

32.9 

15.3 

41.1 

31.8 

47.0 

44.8 

35.1 

BCP[18] 

44.3 

35.2 

9.7 

10.1 

15.1 

44.6 

32.0 

35.3 

4.4 

17.5 

15.0 

27.6 

36.2 

42.1 

30.0 

5.0 

13.7 

18.8 

34.4 

28.6 

25.0 

Poselet [6] 

33.2 

51.9 

8.5 

8.2 

34.8 

39.0 

48.8 

22.2 

- 

20.6 

- 

18.5 

48.2 

44.1 

48.5 

9.1 

28.0 

13.0 

22.5 

33.0 


RM^C [16] 

49.8 

50.6 

15.1 

15.5 

28.5 

51.1 

42.2 

30.5 

17.3 

28.3 

12.4 

26.0 

45.6 

51.8 

41.4 

12.6 

30.4 

26.1 

44.0 

37.6 

32.8 

[10] (w/o context) 

61.3 

46.4 

21.1 

21.0 

18.1 

49.3 

45.0 

46.9 

12.8 

29.2 

26.1 

38.9 

40.4 

53.1 

31.9 

13.3 

39.9 

33.4 

43.0 

45.3 

35.8 

SegDPM [23] 

56.4 

48.0 

24.3 

21.8 

31.3 

51.3 

47.3 

48.2 

16.1 

29.4 

19.0 

37.5 

44.1 

51.5 

44.4 

12.6 

32.0 

28.8 

48.9 

39.1 

36.6 

SegDPM+rescore [23] 

58.7 

51.4 

25.3 

24.1 

33.8 

52.5 

49.2 

48.8 

11.7 

30.4 

21.6 

37.7 

46.0 

53.1 

46.0 

13.1 

35.7 

29.4 

52.5 

41.8 

38.1 

Regionlets [56] 

65.0 

48.9 

25.9 

24.6 

24.5 

56.1 

54.5 

51.2 

17.0 

28.9 

30.2 

35.8 

40.2 

55.7 

43.5 

14.3 

43.9 

32.6 

54.0 

45.9 

39.6 

top-500 (5-Region) 

55.1 

50.8 

16.7 

18.3 

22.6 

50.4 

44.9 

48.3 

10.3 

27.7 

25.6 

35.8 

43.3 

49.9 

27.6 

14.3 

34.2 

31.4 

43.8 

41.7 

34.6 

top-500 (5-Region + BBReg) 

60.8 

52.4 

17.7 

18.9 

25.2 

51.6 

47.6 

49.1 

11.5 

32.1 

27.7 

36.9 

46.2 

53.6 

30.9 

16.5 

36.2 

31.2 

51.4 

43.3 

37.1 


Table 3. Results on VOC-2010: We use top-500 elements 

5. Discussion 

Our work, even though focused on HOG-based mid-level 
elements, shares some insights with the current CNN-based 
methods. [1] showed how learning from large amounts 
of data is one of the strengths of deep networks - when 
the convolutional network is pre-trained on ImageNet data 
(i.e., IM images) [13, 31], the performance on PASCAL 
is significantly higher than when the same network is 
trained on PASCAL images only (54.2% vs. 40.7% mAP 
on VOC-2007). But it is interesting that the deep network 
trained only on PASCAL data still outperforms the canon¬ 
ical DPM [21] (33.7% mAP) by a reasonable margin (7% 
absolute). These multi-layer CNNs share data across cat¬ 
egories to learn features. The simple mid-level represen¬ 
tation we build and investigate in this paper, also enables 
sharing between categories (which was remarkably miss¬ 
ing in most HOG-based pipelines) and allows for encoding 
loose spatial constraints. We believe that these are the main 
reasons we are able to bridge the performance gap between 
CNN and HOG pipelines (even though our representation 
uses the same features as DPM). 

A concurrent work [36] presented an approach to dis¬ 
cover similar mid-level elements using CNN features, and 
achieved state-of-the-art performance on the task of scene 
classification. We believe that our work can also utilize 
these CNN feature based mid-level elements for object de¬ 
tection, and it would be an interesting future work. Further, 
we hope that our work will inspire future research on com¬ 
bining mid-level elements [45, 14] with deep architectures 
(such as learning a hierarchy of mid-level representations). 

The current mid-level discovery approaches [45, 14, 36, 

18, 29] are not easily scalable to millions of images - the 
main bottleneck being dense sliding window mining (de¬ 
tection in HOG-feature pyramid for [45, 14, 18, 29], and 
dense deep-feature extraction for [36]). We are optimistic 
that the methods developed to scale dense sliding window 
object detection [42, 2, 12, 22] will help scale-up current 
mid-level approaches in the near-future. 


and 5-region pooling for this experiment (last 2 rows). 

6. Conclusion 

We have presented a surprisingly simple, yet effective, 
approach for object detection which builds upon the recent 
success of discriminative mid-level elements. This simple 
representation performs comparably to the state-of-the-art 
on the PASCAL VOC comp3 detection challenge. We also 
demonstrate that this representation is easily interpretable, 
in the sense that we can understand what the final classifier 
has learned, and visualize what the representation “sees” 
when it detects or mis-detects an object. We hope this will 
inspire further research on mid-level representations. 
Acknowledgements: This work was partially supported by ONR 
MURIN000141010934. AS, CD and AG were partially supported 
by Microsoft Research PhD Fellowship, Google PhD Fellowship 
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